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Abstract We observe the statistical properties of blogs that are expected to reflect 
social human interaction. Firstly, we introduce a basic normalization preprocess that 
enables us to evaluate the genuine word frequency in blogs that are independent of 
external factors such as spam blogs, server-breakdowns, increase in the population 
of bloggers, and periodic weekly behaviors. After this process, we can confirm that 
small frequency words clearly follow an independent Poisson process as theoretically 
expected. Secondly, we focus on each blogger's basic behaviors. It is found that there 
are two kinds of behaviors of bloggers. Further, Zipf's law on word frequency is 
confirmed to be universally independent of individual activity types. 
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1 Introduction 

Blogs are a new kind of social communication medium in which personal opinions 
can be easily uploaded on the Web. A typical blog site is maintained by an individual 
or a small group. Blog users are called bloggers and they post blog "entries" that 
are freely written texts like those in diaries. These texts include opinions on movies, 
evaluations of purchased items and announcements of social events. Thus, each word 
in blog entries may reflect social phenomena. Search engine technologies have been 
developed to observe the details of blog entries automatically at high speeds. In this 
paper, we focus on the statistics of blogs from both macroscopic and microscopic 
perspectives. 
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2 Data description 

Using a search engine similar to "Google Blog Search,"' we analyzed Japanese blog 
databases that were collected from January 1st 2007 to December 31st 2008 by 
"Dentsu Buzz Research"^ For given keywords, observation period, and search area, 
the search engine automatically lists all entries that fulfill the condition. The search 
engine covers 20 major blog providers in Japan that host more than 10 million blog 
sites. The total number of observed entries is more than 610 million, and there are 
about 800,000 new entries uploaded daily on average. While we only focus on Japanese 
blogs in this paper, the share of Japanese blog sites is known to be largest, about 37%, 
followed by English and Chinese blog sites for the year 2007 according to the report 
of Technorati,"* an internet search engine company for blogs (Technorati 2007). In 
this paper, we firstly focus on the temporal change of word frequency on blog en- 
tries. Specially, we count number of blog entries including a target keyword at least 
once. Namely, if one blog entry includes the target keyword more than two times, we 
regard the number of blog entry as one. We randomly choose words from a dictio- 
nary of Japanese morphological analysis, which is widely used in the field of natural 
language processing. 



3 Noise reductions 

In this section, we introduce a basic procedure to evaluate the genuine word frequency 
in blogs independent of external factors. 



3.1 Effect of spam blogs 

While reading the collected blog entries, we easily find that there are blog entries 
that are obviously not generated by humans. For example, there are cases of blog en- 
tries' texts comprising of a meaningless sequence of words, copied articles from ma- 
jor internet news articles or simply repeated advertisement keywords. Further, some 
entries contain sexual or violent content that lead to a paid-membership site. Collec- 
tively, these examples are called spam blogs. Some spam blogs are created with the 
intention to enhance their ranking at sites such as PageRanks (Page et al. 1998). A 
large amount of spams is generated daily and it causes heavy fluctuations in word 
frequencies. 

In the study of blog analysis, spam is attracting considerable interest, and thus, 
various methods for the detection of spams have been developed (MIC 2009; Nari- 
sawa et al. 2006; Sato et al. 2008). In the search engine of Dentsu Buzz Research, the 
following spam filters are installed: 

' http://blogsearch.google.com/ 
^ http://www.dbuzz.jp/ 
' http://technorati.com/ 
* http://chasen.naist.jp/ 
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Fig. 1 Time series of whole collected blog entries with spam filtered X(t) and time series without filtered 
(dashed line) 

- word salad: Blog contents are a mixture of seemingly meaningful words that 
together signify nothing. 

- copy & paste: Blog contents are automatically or manually excerpted from other 
sources. 

- template: Blog entry comprises template sentences and fixed keywords. 

- multi post: Identical blog entries are posted to different blog sites. 

- adult and gamble: Blog entry contains adult or gambling contents. 

In this paper, we use these spam filters to categorize spam and normal blogs. As 
a result of benchmark testing of the filter for 200 blog entries, the total detection 
accuracy was 83%, about 40% of the collected blog entries being categorized into 
spam. 



3.2 Effect of system maintenance and population growth 

Since blogs are supported by computer systems, there is a possibility that some blog 
servers suddenly stop working because of maintenance or hardware replacement, and 
thus, there may be a sudden decrease of word frequency for a period. Moreover, there 
is a tendency that the total number of blog sites increase almost monotonically (MIC 
2009). Hence, the average number of appearances of any word tends to increase in 
a non-stationary way. Here, we introduce a procedure for adjusting these external or 
systemic non-stationary effects. 

For a given time series of flux fluctuation, Menezes and Barabasi introduced a 
method of separating external noise effect from internal contributions in an open sys- 
tem of complex network model (Menezes and Barabasi 2004). Utilizing the fact that 
flux time series comprise independent small parts of fluxes, they computed the share 
of the small parts of fluxes in the entire range of collected fluxes. With regard to their 
mathematical models, the method works successfully by separating external noises 
from the time series. However, in this case, we cannot assume that each blogger acts 
independently. Therefore, we introduce a new revised method for the separation of 
internal and external fluctuations. 
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Fig. 2 Comparison of correlation between the average value of frequency (.r^) and congelation Cj (circle) 
and C'j (cross). 

For words with low frequency, we assume that bloggers pay Httle attention to 
these words and the blog entry numbers may not be significantly affected by external 
factors. For words with high frequency, we conjecture that bloggers focus on appear- 
ance numbers, and thus, these words are significantly affected by external factors. 
Based on these assumptions, we discern that the contribution of external factors de- 
pends on the average value of word frequency. For any given keyword j, we calculate 
the average value of daily blog entries time series Xj{t) and the correlation coefficient 
between Xj{t) and the time series of whole collected blogs through spam filters, X{t), 
as shown in Fig. 1, 



where (X) s 11,1=0 ^(0 and (xj) s 11,1=0 -^A^)- shown in Fig. 2, we confirm 
that a clear positive correlation exists between {xj) and Cj compared with the case 
that the blog timestamps are randomly shuffled Cy. Once we obtain Cj from {xj), we 
define a new normalization of the time series as follows. 



As shown in Fig. 2, for a small value of {xj), the value of Cj is also small. In this case, 
the first term of Eq.(2) can be neglected and we have Fj{t) » Xj{f). On the other hand, 
for a large value of {xj), the second term in the right hand side of Eq.(2) becomes 
negligible and we have F j{t) ~ -^{X). An example demonstrating the effect of this 
normalization is shown in Fig. 3. We confirm that systematic fluctuations are reduced 
from the time series (Fig. 3) 



YJ,=o(.im-{x))){xj{f)-{xj)) 
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Fig. 3 An example of a word with high frequency (keyword: "if") before and after the normalization 
{(Fj) = 1241.3) 
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Fig. 4 Ratio for the day of the week; keywords are "hospital" and "drive" 



3.3 Effect of weekly period 

There are words that exhibit clear periodic behaviors depending on the day of week. 
For example, "hospital", "office", or "school" are typical words that appear more 
frequently on weekdays. For the purpose of flattening such a weekly period, we sum 
up the number of appearances of words for each day of the week, A^(^), k = 0,1, ...6, 
where k - Q means Sunday, k = I means Monday, etc. Then, the time series of word 
frequency, Fj(t) is normalized by the following way: 

Fjit) = 7777^^^ ^> (3) 
Nj (r mod 7) 7 

where A^^ = Tit=o^j(^) '^he total number of blog entries. Note that for most words, 
such week dependence normalization is not necessary. 



3.4 Result of noise reduction 

With the noise reduction procedures introduced in Sec. 3. 1-Sec. 3.3, we confirm that 
systematic noises are removed and that the time series appears more stationary. For 
words with low frequency, specifically, words that appear less than 1 times per day on 
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Fig. 5 An example of time series (a) and frequency distribution (b) for less frequency word ({Fj) = 0.7); 
keyword is "angstrom" 



average, we confirm that autocorrelation is in 95% significance level, and the dis- 
tribution of intervals of appearance is checked to pass the statistical tests of Poisson 
distribution as demonstrated in Fig. 5. The result of;^^ square test shows that it is not 
rejected by 2.5% significance level. As a result of randomly selected 300 words from 
Japanese morphological analysis dictionary, only one word, "Angstrorm", passed the 
X square test while remaining words appeared more than 1 times per day on average. 

In a pioneering study of the basic statistics of blogs, while Lambiotte et al. (2007) 
reported that the Poissonian hypothesis is always rejected, they did not apply system- 
atic noise reduction procedures. For words with high frequency, we generally find a 
clear deviation from the simple Poisson process as already presented (Lambiotte et 
al. 2007), even these noise reduction processes are fully applied. This implies that 
there is potential interaction among bloggers. 

4 Bloggers' individual properties 

So far, we have discussed about word frequency in blogs from a macroscopic point 
of view. In this section, we focus on the individual properties of bloggers, which are 
expected to form the base for the development of agent-based modeling of blogs. 

4. 1 Intervals of posting 

In this section, we focus on the bloggers' behaviors of posting blogs. We analyze 
bloggers' data in which individual bloggers' entries are recorded with the time stamp 
of precision in second from November 1st 2006 to March 31st 2009. If a blogger's 
behavior is approximated by an independent Poisson process, the distribution of in- 
tervals is approximated by an exponential function and the autocorrelation is almost 
0. From this theoretical viewpoint, we categorize the bloggers into two cases: 

- Case r. Poissonian bloggers 

- Cflie 2: Non-Poissonian bloggers 

In Fig. 6, we show two typical examples belonging to these two cases. In Case 1, 
the observed time series of postings (the top figure of Fig. 6a) is characterized by a 
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quick decay of the autocorrelation function (Fig. 6b) and the distribution of intervals 
is well approximated by an exponential function (Fig. 6c). Therefore, an indepen- 
dent Poisson process can form the base of the behavior for such bloggers. On the 
other hand, in Case 2, the occurrence time series clearly shows clustering (the bottom 
figure of Fig. 6a), and the autocorrelation decays slowly (Fig. 6d). Furthermore, the 
interval distribution has a fat-tail that is approximated by a power law (Fig. 6e). From 
this example, we find that a non-Poissonian blogger possesses strong memory in that 
once he (or she) posts an entry, he (or she) tends to continue posting entries. In this 
analysis, we classify 1 10 bloggers into these two cases. There are 10 Poissonian blog- 
gers and the remaining 100 bloggers are categorized into the non-Poissonian case by 
Kolmogorov-Smirnov test applied to the sequence of posting time intervals. We also 
confirm that autocorrelation functions of Poissonian bloggers are always within 95% 
confidence bands. Additionally, we also checked 9753 bloggers during the observa- 
tion period from November 1st 2006 to July 10th 2009 by the Kolmogorov-Smirnov 
test, and we confirmed that 1089 (about 11%) are categorized into the Poissonian 
bloggers. In contrast to the keyword appearance described in Sec. 3.4, the behaviors 
of bloggers can not be simply categorized by the average number of entries, that is, 
there are both kinds of bloggers in any posting rate groups. 

4.2 Individual word frequency 

In this section, we investigate the frequency of words in individual blog entries. The 
study of the frequency of words started in the 1930s by the linguist Zipf who counted 
the number of appearance of words in various documents. He determined the rank 
by sorting the words with respect to the frequency and found an empirical law that 
the frequency of a word is approximately proportional to the inverse of its rank (Zipf 
1949). This old law, generally called "Zipf's law," still attracts considerable inter- 
est among scientists of various fields because it is applicable not only to linguistic 
problems but also to a wide variety of phenomena such as the incomes of companies 
(Okumura et al. 1999) and the abundances of expressed genes distributions (Furusawa 
and Kaneko 2003). 

In Fig. 7, word frequency distributions are plotted for both Poissonian and non- 
Poissonian bloggers, and we can find that Zipf's law holds in both cases. High fre- 
quency words are mainly postpositional particles and auxiliary verbs that commonly 
appear in all blog sites. Further, some topical keywords also appear frequently in 
each blogger's entries. In the case of words with low frequency, there are no com- 
mon words and the words depend on each blogger's characteristics. It is noteworthy 
that even under such non-uniformity, Zipf's empirical law holds for each blogger's 
entries. 

5 An application of blogs 

Finally, we introduce an example showing that blogs can efficiently capture certain 
kinds of social phenomena. Special keywords such as "flu" and "pollen allergy" ap- 
pear periodically every year with sharp peaks as shown in Fig. 8a. We find that the 
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Fig. 6 Comparison of two cases of posting blog entries (a); Case 1 is well approximated by Poisson 
process and Case 2 reveals non-trivial correlation; Autocorrelation function (b), (d); and cumulative dis- 
tribution of posting interval (c), (e) 
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Fig. 7 Word frequency distribution of dilferent bloggers. The guideline follows a power law with an ex- 
ponent of - 1 
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Fig. 8 Time series including word "pollen" and the amount of airborne pollen (a). Time series of after 
peak tc in semi-log scale (b) 



number of blog entries including the keyword "pollen" is closely associated with 
the amount of airborne pollen in Tokyo. ^ Furthermore, this sharp rise and decay is 
well approximated by exponential functions as plotted in Fig. 8b. A recent paper by 
Ginsberg et al. (2009) introduced a method of detecting influenza epidemics by us- 
ing large numbers of Google search queries to track influenza-like illness. There is a 
possibility that blogs can also be used as an observation tool for the development of 
epidemics or allergy. 



6 Discussion and conclusion 

In this paper, we proposed a basic preprocess for the separation of external system- 
atic noises from the time series of blog entries. After this normalization procedure, 
we confirmed that as theoretically expected, the appearance of low frequency words 
clearly follows a Poisson process. With regard to words with high frequency, the 
Poissonian assumption does not hold in any case, implying that existence of a strong 
non-trivial correlation among those words. We focused on each blogger's behavior 
of posting blog entries and found that bloggers can be categorized into two cases; 
Poissonian and non-Poissonian bloggers. About 20% of bloggers belong to the Pois- 
sonian case in which basic behaviors can be modeled by an independent Poisson 
process. The rest of the bloggers tend to behave in an intermittent manner with strong 
memory effect. In any case, the word frequency follows Zipf's law for each individ- 
ual. 

We expect that these basic results will play an important role in the construction 
of a microscopic agent-based model of bloggers in the near future. One of the targets 
of such a microscopic model will be the explanation of the macroscopic behaviors re- 
lated to the exponential rise and decay of commonly used keywords such as "pollen" 
shown in Sec. 5. 



' http://kafun.jaanet.org/ 



10 



Acknowledgement 

The authors are grateful to the corporations of Dentsu Inc. and HottoUnk Inc. for 
providing blog data. 

References 

Furusawa C, Kaneko K (2003) Zipf's Law in Gene Expression. Phys Rev Lett 
90:088102 

Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L 
(2009) Detecting Influenza Epidemics Using Search Engine Query Data. Nature 
457:1012-1014 

Lambiotte R, Ausloos M, ThelwaU M (2007) Word Statistics in Blogs and RSS Feeds: 

Towards Empirical Universal Evidence. J Infometrics 1:277-286 
Menezes MA and Barabasi A-L (2004) Separating Internal and External Dynamics 

of Complex Systems. Phys Rev Lett 93:068702 
Ministry of Internal Affairs and Communications, Report of Institute for Information 

and Communication Policy [Japanese] (2009) 
Narisawa K, Yamada Y, Ikeda D, Takeda M (2006) Detecting Blog Spams using 

the Vocabulary Size of All Substrings in Their Copies. Proc. of the 3rd Annual 

Workshop on Weblogging Ecosystem 
Okuyama k, Takayasu H, Takayasu M (1999) Zipf's Law in Income Distribution of 

Companies. Physica A 269:125-131 
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank Citation Ranking. 

Bringing from the Standford Digital Library Technologies Project 
Sato Y, Utsuro T, Fukuhara T, Kawada Y, Murakami Y, Nakagawa H, Kando N (2008) 

Analysing Features of Japanese Splogs and Characteristics of Keywords. Proc. of 

the 4th International Workshop on Adversarial Information Retrieval on the Web 

pp33-40 

The State of the Live Web, April 2007. In: Sifry's Alerts Available via 
http://www.sifry.com/alerts/archives/000493.html. Accessed April 30th, 2009 

Zipf GK (1949) Human Behavior and the Principle of Least Effort. Addison- Wesley, 
Cambridge 



